Metabarcoding exploratory data analysis

In this notebook we will analyze PacMAN pipeline results for a subset of COI samples from Rey, A., Basurko, O. C., & Rodriguez-Ezpeleta, N. (2020). Considerations for metabarcoding-based port biological baseline surveys aimed at marine nonindigenous species monitoring and risk assessments. Ecology and Evolution, 10(5), 2452–2465. https://doi.org/10.1002/ece3.6071

Running this notebook

To run this notebook in its entirety or to have access to the data files used in this notebook, clone or download and extract the iobis/pacman-pipeline-training repository. The notebook can be found in the datasets/rey/analysis directory, open this directory in RStudio after downloading (File > Open Project or navigate to Desktop in the Files panel and select More > Set As Working Directory).

If you have trouble locating the Desktop folder in RStudio, set your Desktop as the default working directory using Tools > Global Options > Default working directory.

Pipeline report

The repository you downloaded also contains the report generated by the pipeline, see datasets/rey/results_noblast/report.html.

Importing data

The PacMAN pipeline exports results as a phyloseq object which can be imported into R for analysis. First install the phyloseq package from Bioconductor:

if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("phyloseq")

This is how phyloseq objects are structured:

Read the phyloseq object and verify that we have:

  • a OTU table
  • a sample table
  • a taxonomy table

A phylogenetic tree has not been calculated so that slot is empty.

physeq <- readRDS("../results_noblast/phyloseq_object.rds")

physeq
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 5327 taxa and 22 samples ]
## sample_data() Sample Data:       [ 22 samples by 20 sample variables ]
## tax_table()   Taxonomy Table:    [ 5327 taxa by 14 taxonomic ranks ]

While we can access these tables individually (for example using physeq@sam_data or sample_data(physeq)), the phyloseq package also has a function psmelt() to convert the phyloseq object into a single large data frame:

library(dplyr)
library(phyloseq)

df <- psmelt(physeq) %>%
  # convert empty strings to NA across all character columns
  mutate_if(is.character, ~na_if(., "")) %>%
  # convert to tibble for prettier printing
  as_tibble()

df
## # A tibble: 117,194 × 37
##    OTU    Sample Abund…¹ eventID mater…² event…³ verba…⁴ local…⁵ verba…⁶ event…⁷
##    <chr>  <chr>    <int> <chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <chr>  
##  1 asv.2  SAMN1…   25147 SAMN10… SAMN10… settle… 43.28 … Spain:… 7m      2017-1…
##  2 asv.7  SAMN1…   16275 SAMN10… SAMN10… settle… 43.28 … Spain:… 7m      2017-0…
##  3 asv.8  SAMN1…   15862 SAMN10… SAMN10… settle… 43.37 … Spain:… 7m      2017-0…
##  4 asv.6  SAMN1…   15401 SAMN10… SAMN10… filter… 43.34 … Spain:… surface 2017-0…
##  5 asv.5  SAMN1…   12944 SAMN10… SAMN10… settle… 43.37 … Spain:… 7m      2017-1…
##  6 asv.11 SAMN1…   12292 SAMN10… SAMN10… settle… 43.34 … Spain:… 7m      2017-0…
##  7 asv.14 SAMN1…   11142 SAMN10… SAMN10… filter… 43.37 … Spain:… surface 2017-0…
##  8 asv.10 SAMN1…   10447 SAMN10… SAMN10… filter… 43.37 … Spain:… surface 2017-0…
##  9 asv.3  SAMN1…   10377 SAMN10… SAMN10… settle… 43.34 … Spain:… 7m      2017-1…
## 10 asv.18 SAMN1…    9850 SAMN10… SAMN10… settle… 43.37 … Spain:… 7m      2017-0…
## # … with 117,184 more rows, 27 more variables: occurrenceStatus <chr>,
## #   target_gene <chr>, subfragment <chr>, pcr_primer_forward <chr>,
## #   pcr_primer_reverse <chr>, pcr_primer_name_forw <chr>,
## #   pcr_primer_name_reverse <chr>, pcr_primer_reference <chr>,
## #   lib_layout <chr>, seq_meth <chr>, sop <chr>, votu_db <chr>, My_name <chr>,
## #   kingdom <chr>, phylum <chr>, class <chr>, order <chr>, family <chr>,
## #   genus <chr>, species <chr>, lastvalue <chr>, otu_seq_comp_appr <chr>, …

Exploratory data analysis

Abundance by phylum

Let’s first determine which are the most abundant phyla across all samples. We will use this information to bin the less abundant phyla into a category Other for simplifying some of our graphics.

stats_phyla <- df %>%
  group_by(phylum) %>%
  summarize(abundance = sum(Abundance)) %>%
  arrange(desc(abundance))

top_phyla <- stats_phyla$phylum %>% na.omit() %>% head(8)
top_phyla
## [1] "Arthropoda"      "Cnidaria"        "Bacillariophyta" "Annelida"       
## [5] "Bryozoa"         "Nemertea"        "Chlorophyta"     "Mollusca"

Now create a color scale for these phyla using the RColorBrewer package:

phylum_colors <- RColorBrewer::brewer.pal(9, "Spectral")
names(phylum_colors) <- c(top_phyla, "Other")
phylum_colors
##      Arthropoda        Cnidaria Bacillariophyta        Annelida         Bryozoa 
##       "#D53E4F"       "#F46D43"       "#FDAE61"       "#FEE08B"       "#FFFFBF" 
##        Nemertea     Chlorophyta        Mollusca           Other 
##       "#E6F598"       "#ABDDA4"       "#66C2A5"       "#3288BD"

And add a new column with binned phyla:

df <- df %>%
  mutate(phylum_binned = ifelse(is.na(phylum) | phylum %in% top_phyla, phylum, "Other")) %>%
  mutate(phylum_binned = factor(phylum_binned, levels = c(top_phyla, "Other")))

Let’s also add the binned phyla to the phyloseq object for later use:

# remotes::install_github("david-barnett/microViz")

physeq <- physeq %>%
  microViz::tax_mutate(phylum = ifelse(phylum == "", NA, phylum)) %>%
  microViz::tax_mutate(phylum_binned = ifelse(is.na(phylum) | phylum %in% top_phyla, phylum, "Other"))

Abundance by sample type and phylum

First list the abundance by phylum and sample type:

library(tidyr)

stats_type_phylum <- df %>%
  group_by(eventRemarks, phylum) %>%
  summarize(abundance = sum(Abundance))

spread(stats_type_phylum, eventRemarks, abundance) %>%
  mutate(total = `settlement plates` + `filtered water`) %>%
  arrange(desc(total)) %>%
  knitr::kable()
phylum filtered water settlement plates total
NA 353725 126557 480282
Arthropoda 13857 99041 112898
Cnidaria 3603 60871 64474
Bacillariophyta 45861 979 46840
Annelida 8706 15085 23791
Bryozoa 454 22631 23085
Nemertea 7 20234 20241
Chlorophyta 17922 10 17932
Mollusca 3801 6219 10020
Haptista 4682 13 4695
Rhodophyta 3082 197 3279
Rotifera 1926 56 1982
Oomycota 1303 414 1717
Basidiomycota 1481 18 1499
Chordata 1213 211 1424
Platyhelminthes 11 1382 1393
Prasinodermophyta 1071 0 1071
Echinodermata 767 53 820
Ascomycota 733 3 736
Streptophyta 444 0 444
Porifera 281 115 396
Discosea 173 63 236
Nematoda 21 95 116
Placozoa 96 0 96
Phoronida 57 18 75
Evosea 62 7 69
Gastrotricha 6 32 38
Picozoa 28 0 28
Chytridiomycota 18 0 18
Sipuncula 12 0 12
Imbricatea 0 7 7
Chaetognatha 5 0 5
Entoprocta 0 4 4
Tubulinea 3 0 3
Onychophora 2 0 2
Zoopagomycota 2 0 2

Now create a graph using the binned phyla:

library(ggplot2)

stats_type_phylum <- df %>%
  # calculate total abundance by sample type
  group_by(eventRemarks) %>%
  mutate(abundance = Abundance / sum(Abundance)) %>%
  group_by(eventRemarks, phylum_binned) %>%
  summarize(abundance = sum(abundance))
  
ggplot() +
  geom_bar(data = stats_type_phylum, aes(y = eventRemarks, fill = phylum_binned, x = abundance), stat = "identity") +
  scale_fill_manual(values = phylum_colors, na.value = "#eeeeee") +
  theme_minimal()

Abundance (reads) by sample and phylum (optional)

stats_sample_phylum <- df %>%
  # calculate total abundance by sample type
  group_by(Sample) %>%
  mutate(relative_abundance = Abundance / sum(Abundance)) %>%
  group_by(Sample, eventRemarks, phylum_binned) %>%
  summarize(relative_abundance = sum(relative_abundance), abundance = sum(Abundance))

Absolute reads abundance:

ggplot() +
  geom_bar(data = stats_sample_phylum, aes(y = Sample, fill = phylum_binned, x = abundance), stat = "identity") +
  scale_fill_manual(values = phylum_colors, na.value = "#eeeeee") +
  theme_minimal() +
  theme(panel.grid.major.y = element_blank(), panel.grid.minor.y = element_blank())

Relative reads abundance:

ggplot() +
  geom_bar(data = stats_sample_phylum, aes(y = Sample, fill = phylum_binned, x = relative_abundance), stat = "identity") +
  scale_fill_manual(values = phylum_colors, na.value = "#eeeeee") +
  theme_minimal() +
  theme(panel.grid.major.y = element_blank(), panel.grid.minor.y = element_blank())

Most abundant species

Let’s list the species with the highest relative abundance across all samples:

top_species <- df %>%
  filter(!is.na(species)) %>%
  group_by(species) %>%
  summarize(abundance = sum(Abundance)) %>%
  arrange(desc(abundance)) %>%
  head(20)

top_species %>% knitr::kable()
species abundance
Balanus trigonus 37896
Bougainvillia muscus 33490
Minutocellus polymorphus 27149
Obelia dichotoma 19006
Emplectonema gracile 16805
Micromonas pusilla 15598
Dictyocha speculum 13538
Bugula neritina 10243
Harpyia umbrosa 9099
Cutleria multifida 5503
Ostrea stentina 5267
Polydora neocaeca 5066
Paracalanus parvus 4374
Sabellaria spinulosa 3627
Amphibalanus eburneus 3567
Palmaria decipiens 2398
Amphinema dinema 2084
Chloroparvula pacifica 1528
Pseudochattonella farcimen 1509
Campanularia hincksii 1478

For the most abundant species, plot the abundance by sample:

stats <- df %>%
  filter(species %in% top_species$species) %>%
  group_by(species, eventRemarks, Sample) %>%
  summarize(abundance = sum(Abundance))

ggplot() +
  geom_jitter(data = stats, aes(y = species, x = abundance, color = eventRemarks), pch = 21, height = 0.1, width = 0) +
  scale_color_brewer(palette = "Set1") +
  theme_minimal()

Invasive species

In this section we will check our results against the World Register of Introduced Marine Species (WRiMS). The repository contains a CSV file containing the LSIDs for all species in WRiMS. Read this CSV and use the LSIDs to only retain species from WRiMS.

wrims_lsids <- read.csv("wrims_lsids.csv")

invasives <- df %>%
  filter(lsid %in% wrims_lsids$lsid)

invasives %>%
  group_by(phylum, species) %>%
  summarize(abundance = sum(Abundance)) %>%
  arrange(desc(abundance)) %>%
  rmarkdown::paged_table()

Alpha diversity

Use vegan::plot_richness() to visualize the number of ASVs per sample:

plot_richness(physeq, measures = c("Observed"), color = "eventRemarks")

The vegan package can also calculate a number of biodiversity indices. But keep in mind that some of these will use the reads abundance which does not necessarily correlate with actual abundance in terms of individuals, such as the Chao1 estimator which uses the numbers of singletons and doubletons to obtain a lower bound for the expected asymptotic species richness, or Shannon entropy which uses relative abundance.

estimate_richness(physeq)
##              Observed    Chao1    se.chao1      ACE    se.ACE  Shannon
## SAMN10847219      731 731.0612 0.262882645 731.6197 13.507301 5.231046
## SAMN10847223      654 654.0096 0.100160727 654.5405 11.803509 4.204983
## SAMN10847226      529 529.1538 0.422097183 530.5108 11.452192 4.085992
## SAMN10847228      363 363.2500 0.580982709 363.9410  9.507433 3.291294
## SAMN10847229      407 407.0256 0.169157714 407.5179 10.070721 3.469086
## SAMN10847231      719 719.0000 0.010862004 719.2273 13.321963 4.349963
## SAMN10847249      284 284.0000 0.013864415 284.2858  8.422944 3.337518
## SAMN10847251      483 483.0182 0.140266223 483.5069 10.838516 4.407202
## SAMN10847253      372 372.0588 0.257092009 372.8648  9.601941 2.782583
## SAMN10847255      407 407.0000 0.014268153 407.2320 10.079683 3.696802
## SAMN10847264      258 258.0000 0.000000000 258.0000  6.489407 3.189349
## SAMN10847265      337 337.0000 0.008052542 337.2954  8.390622 4.633680
## SAMN10847266      392 392.0000 0.000000000 392.0000  8.630747 3.276873
## SAMN10847267      211 211.0000 0.020783907 211.2352  7.158267 2.975029
## SAMN10847272      311 311.0000 0.013136724 311.2671  8.813702 3.695959
## SAMN10847276      687 687.0000 0.005613887 687.2582 12.784487 4.464019
## SAMN10847333      298 298.0000 0.000000000 298.0000  8.631144 3.434833
## SAMN10847336      193 193.0000 0.000000000 193.0000  6.877914 2.778169
## SAMN10847345      399 399.0000 0.000000000 399.0000  9.948993 3.479036
## SAMN10847348      289 289.0000 0.000000000 289.0000  8.320983 3.581685
## SAMN10847357      130 130.0000 0.000000000 130.0000  5.395155 2.745419
## SAMN10847359      216 216.0000 0.000000000 216.0000  7.328281 2.052828
##                Simpson InvSimpson    Fisher
## SAMN10847219 0.9853373  68.200038 137.79902
## SAMN10847223 0.9587094  24.218612 118.09885
## SAMN10847226 0.9419372  17.222718  90.67119
## SAMN10847228 0.8741244   7.944350  57.16947
## SAMN10847229 0.9131446  11.513387  60.37124
## SAMN10847231 0.9441269  17.897686 115.89380
## SAMN10847249 0.9072691  10.783896  44.15926
## SAMN10847251 0.9713287  34.878095  84.47387
## SAMN10847253 0.6952500   3.281378  60.55650
## SAMN10847255 0.9371758  15.917441  62.94423
## SAMN10847264 0.8099860   5.262770  55.31375
## SAMN10847265 0.9771990  43.857680  75.87903
## SAMN10847266 0.8726964   7.855237  74.01503
## SAMN10847267 0.8690806   7.638287  32.16280
## SAMN10847272 0.9359133  15.603862  50.10914
## SAMN10847276 0.9645337  28.195777 117.72100
## SAMN10847333 0.9222313  12.858640  39.56879
## SAMN10847336 0.8182685   5.502624  27.12397
## SAMN10847345 0.9346967  15.313164  54.47372
## SAMN10847348 0.9344302  15.250929  39.89149
## SAMN10847357 0.8898749   9.080580  15.58998
## SAMN10847359 0.6503482   2.859988  29.61067

ASV accumulation curves

Rarefaction is an interpolation of species richness to smaller samples sizes to allow comparisons between samples.

vegan::rarecurve(t(otu_table(physeq)), step = 1000)

Do it yourself (optional)

Alternatively, run vegan::rarefy() for every sample and construct the graph yourself:

community <- t(otu_table(physeq))

curves <- purrr::map(physeq@sam_data$eventID, function(sample) {
  community_row <- community[sample,]
  sample_sizes <- seq(1000, sum(community_row), by = 1000)
  rarefied <- vegan::rarefy(community_row, sample_sizes)
  curve <- data.frame(
    eventID = sample,
    sample_size = sample_sizes,
    asvs = as.numeric(rarefied)
  ) %>%
    left_join(physeq@sam_data, by = "eventID")
}) %>%
  bind_rows()

samples <- curves %>%
  group_by(eventID) %>%
  summarize(sample_size = max(sample_size), asvs = max(asvs))

ggplot() +
  geom_line(data = curves %>% group_by(eventID), aes(x = sample_size, y = asvs, color = eventRemarks, group = eventID)) +
  geom_label(data = samples, aes(sample_size, asvs, label = eventID), size = 2) +
  scale_color_brewer(palette = "Set1") +
  theme_classic()

Beta diversity

Multidimensional scaling with phyloseq

Phyloseq’s ordinate() function allows us to do MultiDimensional Scaling (MDS) or Principal Coordinate Analysis (PCoA) on our dataset. The default for this function is to use Bray-Curtis Dissimilarity to calculate a distance matrix.

ord <- ordinate(physeq = physeq, method = "PCoA", distance = "bray")

plot_ordination(physeq = physeq, ordination = ord, type = "samples", color = "eventRemarks", shape = "locality") +
  geom_point(aes(color = eventRemarks), alpha = 1, size = 4) +
  geom_text(aes(label = materialSampleID), alpha = 0.7, size = 3, vjust = 2) +
  stat_ellipse(aes(group = eventRemarks)) +
  scale_color_brewer(palette = "Set1") +
  scale_shape(solid = TRUE) +
  theme_classic()

Adding taxa (optional)

We can also add taxa to the ordination plot, but styling this is a bit awkward as samples and taxa are treated as a single dataset.

plot_ordination(physeq = subset_taxa(physeq, !is.na(phylum)), ordination = ord, type = "biplot", color = "phylum_binned") +
  scale_color_manual(values = phylum_colors, na.value = "#000000") +
  geom_text(aes(label = materialSampleID), alpha = 0.7, size = 3, vjust = 2) +
  theme_classic()

Non-metric MDS (NMDS) with phyloseq (optional)

Another ordination method is non-metric MDS (NMDS). Let’s try NMDS with our dataset aggregated at the species level, and plot the most abundant species together with the sites.

First aggregates the phyloseq object to the species level:

physeq_species <- tax_glom(physeq, taxrank = "species", NArm = TRUE, bad_empty = c(NA, ""))

Then perform the ordination with method NMDS:

ord_species <- ordinate(physeq_species, method = "NMDS", distance = "bray")

Extract the NMDS values for the sites and species, and join with the relevant tables from the phyloseq object to get the necessary metadata for plotting (such as species name, locality, and sample type):

sites <- ord_species$points %>%
  as.data.frame() %>%
  tibble::rownames_to_column("site") %>%
  inner_join(physeq_species@sam_data %>% as.data.frame() %>% tibble::rownames_to_column("site"), by = "site")

species <- ord_species$species %>%
  as.data.frame() %>%
  tibble::rownames_to_column("asv") %>%
  left_join(physeq_species@tax_table %>% as.data.frame() %>% tibble::rownames_to_column("asv"), by = "asv") %>%
  filter(species %in% top_species$species)
ggplot() +
  geom_segment(data = species, aes(x = 0, y = 0, xend = MDS1, yend = MDS2), arrow = arrow(length = unit(2, "mm")), color = "#aaaaaa") +
  geom_point(data = sites, aes(MDS1, MDS2, color = eventRemarks, shape = locality), size = 3, stroke = 1.5) +
  geom_text(data = species, aes(MDS1, MDS2, label = species), size = 3, vjust = "outward") +
  scale_color_brewer(palette = "Set1") +
  scale_shape(solid = FALSE) +
  theme_classic()